1 Overview

This document illustrates the application of the three-stage methodology using national healthcare (or mortality) data to predict cancer incidence (CI) in geographical areas without an cancer registry. The method and the R code used is illustrated for two cancer sites: pancreatic cancer in men and melanoma in women over the 2007-2015 period. This document details the different steps of calibration, validation and post-processing of the data. Its organisation follows the method section of the article.

Note that, in order to comply with diffusion legal constrains, the numbers of cancer and HC cases in the datasets were slightly altered ; so the results obtained here slightly differ from those presented in the article.

1.1 R-package

The material and R code used in this application are part of the R-package CalibInc, available in Github. Use the package devtools to install the package CalibInc:

require(devtools)
install_github("echatignoux/CalibInc")

After installation, the package can be loaded into R.

library(CalibInc)

The tidyverse, magrittr, mgcv and spdep packages are also needed to run the following code. The spatial smoothing is done using the R-INLA package (see http://www.r-inla.org for installation instruction).

library(tidyverse)
library(magrittr)
library(mgcv)
library(spdep)
library(INLA)

1.2 Data sources

The application of the method is illustrated here for two cancer sites, pancreatic cancer in men and melanoma in women, for the 2007-2015 period.

To evaluate the calibration model, incidence and HC data for districts covered by a cancer registry are needed. They are provided in the table dt_calib.

To predict incidence in all districts, we need HC data for all the French districts. These data are in the table dt_pred.

2 Calibration step

2.1 Overview of the data

The data set used for calibration and validation consists, in each district covered by a cancer registry, in the number of cancer cases (N_I), the number of patients retrieved in HC data (N_HC), aggregated by cancer cause (site), district (dist), HC source (SRC) and 5 years age groups (column age is the median of age group). Additional columns py and pop_WHO respectively refer to the number of person-year (sum over period 2007-2015) and to the standard WHO population of 1960.

The structure of the data is presented below:

set.seed(200)
dt_calib%>%sample_n(5)
## # A tibble: 5 x 8
##   site           dist  SRC     age   N_I  N_HC      py pop_WHO
##   <fct>          <fct> <fct> <dbl> <int> <dbl>   <dbl>   <dbl>
## 1 Melanoma-women d_13  HA     67.5    49    50 154338.    3000
## 2 Pancreas-men   d_20  HA     32.5     0     0  72027     6000
## 3 Melanoma-women d_7   M      57.5   139    14 397739     4000
## 4 Pancreas-men   d_4   H      17.5     0     0 157488.    9000
## 5 Melanoma-women d_7   H      32.5    48    20 381150.    6000

2.2 Evaluation of the calibration model

2.2.1 Estimation

The calibration model (model 1 in the manuscript) is evaluated, for each cancer site and health-care proxy, via the gam function from mgcv package. The log of the number of incident cancer cases is introduced as an offset in the model (thus, rows with no incident cancer cases are not used to evaluate the calibration model). The age effect is introduced as a thin-plate spline (s(age,bs="tp")), with one knot at distinct values of observed age (knots=nk). The district effect is modelled as an intercept random effect (s(dist,bs="re")).

The model formula is specified as below:

nk<-list(age=unique(dt_calib$age)) 
form<-N_HC~offset(log(N_I))+s(age,bs="tp")+s(dist,bs="re") 

And the models evaluated separately for each cancer site and proxy:

Pancreas - men

dt_panc<-dt_calib%>%filter(site=="Pancreas-men")%>%droplevels()
m_panc_A<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="A",N_I>0),method="REML")
m_panc_H<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="H",N_I>0),method="REML")
m_panc_HA<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="HA",N_I>0),method="REML")
m_panc_M<-gam(form,family="quasipoisson",knots=nk,data=dt_panc%>%filter(SRC=="M",N_I>0),method="REML")

Melanoma - women

dt_mela<-dt_calib%>%filter(site=="Melanoma-women")%>%droplevels()
m_mela_A<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="A",N_I>0),method="REML")
m_mela_H<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="H",N_I>0),method="REML")
m_mela_HA<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="HA",N_I>0),method="REML")
m_mela_M<-gam(form,family="quasipoisson",knots=nk,data=dt_mela%>%filter(SRC=="M",N_I>0),method="REML")

2.2.2 Model checking

The quality of fit of the models can be visualized via the gam.check function. For the melanoma model with A proxy for example, no influential observation nor violation of model assumptions is apparent.

gam.check(m_mela_A)

Similar results apply to the other models.

2.2.3 HC/I ratios

The shape of the ratios may be visualized with the plot.gam function, or obtained as predictions with predict.gam.

Using predict.gam, we may superimpose modelled mean HC/I ratio (red curve) with the ratio observed in districts covered by registries (grey points and lines):

Code for graphs of HC/I ratios - pancreas, men

## Predictions with N_I=1 (i.e. mean ratio) for each proxy
dt_rat<-tibble(age=unique(dt_calib$age),N_I=1,dist="d_1")
dt_rat$rat_A<-predict(m_panc_A,exclude="s(dist)",newdata=dt_rat,type="response")
dt_rat$rat_H<-predict(m_panc_H,exclude="s(dist)",newdata=dt_rat,type="response")
dt_rat$rat_HA<-predict(m_panc_HA,exclude="s(dist)",newdata=dt_rat,type="response")
dt_rat$rat_M<-predict(m_panc_M,exclude="s(dist)",newdata=dt_rat,type="response")

## Convertign to raw data
dt_rat<-dt_rat%>%
  gather(SRC,rat,rat_A:rat_M)%>%
  mutate(SRC=gsub("rat_","",SRC))

## Graph
dt_panc%>%
  ggplot(data=.)+
  geom_jitter(aes(age,N_HC/N_I,group=dist,size=N_I),alpha=0.2,height = 0,width = 0.5)+
  geom_line(aes(age,N_HC/N_I,group=dist),alpha=0.2)+
  geom_line(data=dt_rat,aes(age,rat),colour="red",size=1)+
  coord_cartesian(ylim=c(0,2),xlim=c(20,90))+
  facet_wrap(~SRC)+
  ylab("HC/I")+xlab("Age")+
  theme(legend.position = "bottom")+labs(size="# I")
HC/I ratio by age - pancreas, men

Figure 2.1: HC/I ratio by age - pancreas, men

HC/I ratio by age - melanoma, women

Figure 2.2: HC/I ratio by age - melanoma, women

The graphs first show that the heights and shapes of the HC/I ratios by age vary between cancer sites and proxys. Second, the spread of districts specific ratios around the mean ratio also varies among proxys and cancer site. If we skip younger ages for which ratios are highly variables due to small number of cases, district ratios generally look more clustered around the mean ratio for pancreatic cancer in men than for melanoma in women.

The visual impression is confirmed by the values of the standard deviation \(\sigma_d\) of the district random effects, which are obtained via the gam.vcomp function (line ‘(dist)’):

gam.vcomp(m_mela_A)
## 
## Standard deviations and 0.95 confidence intervals:
## 
##             std.dev       lower       upper
## s(age)  0.003441324 0.001382654 0.008565204
## s(dist) 0.173951134 0.119509734 0.253192741
## scale   1.034473744 0.945555772 1.131753365
## 
## Rank: 3/3

Applied to each model, it appears clearly that district deviations from the mean calibration curves are much lower for pancreatic cancer in men than melanoma in women.

Table 2.1: Between district-variability of the HC/I ratios
\(\sigma_d\)
Cancer site A H HA M
Pancreas-men 0.02 0.06 0.02 0.05
Melanoma-women 0.17 0.21 0.15 0.18

3 Predictions in a district

For each model, the predictions are obtained with the CalibInc function from CalibInc package.

The code is illustrated below for prediction of melanoma cancer incidence in women, using H proxy.

3.1 Number of incident cases

For each district, the total number of cancer cases predicted with H proxy (and associated standard errors) is obtained from the calibration model m_mela_H and H data as follow:

dt_mela_H<-dt_calib%>%filter(site=="Melanoma-women",SRC=="H")
dt_pred_mela_H<-CalibInc(mod=m_mela_H,pred=dt_mela_H,aggregate=~dist)
dt_pred_mela_H
## # A tibble: 17 x 4
##   dist   N_HC  pred    se
##   <fct> <dbl> <dbl> <dbl>
## 1 d_1     357  629. 144. 
## 2 d_2     160  283.  67.5
## 3 d_3     424  729. 166. 
## 4 d_5     296  527. 121. 
## 5 d_7     561  992. 224. 
## # ... with 12 more rows

The aggregate argument of the CalibInc function is used here to specify the aggregation level of the predictions (the district level here).

Predictions intervals at 95% level may subsequently be added to the result table by using the LogNormPI function from CalibInc.

dt_pred_mela_H%>%LogNormPI()
## # A tibble: 17 x 6
##   dist   N_HC  pred    se   low    up
##   <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 d_1     357  629. 144.   414. 1004.
## 2 d_2     160  283.  67.5  184.  462.
## 3 d_3     424  729. 166.   482. 1161.
## 4 d_5     296  527. 121.   346.  844.
## 5 d_7     561  992. 224.   657. 1573.
## # ... with 12 more rows

Similarly, the predicted number of incident cases by age and districts may be obtained by modifying the aggregate argument:

CalibInc(mod=m_mela_H,pred=dt_mela_H,aggregate=~dist+age)%>%
  sample_n(5)%>%
  LogNormPI()
## # A tibble: 5 x 7
##   dist    age  N_HC  pred    se    low    up
##   <fct> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 d_2    37.5     6  13.6  6.86   6.00  38.7
## 2 d_12   37.5    10  22.7  9.41  11.2   53.6
## 3 d_5    37.5    17  38.6 13.4   21.0   79.3
## 4 d_20   52.5    11  23.4  9.43  11.8   53.9
## 5 d_20   22.5     0   0    0    NaN    NaN

3.2 Standardized incidence

Interest usually lies on standardised summary of the incidence in the districts, such as standardized incidence rate or standardized incidence ratio.

3.2.1 Standardized incidence ratio

The standardized incidence ratio in a district \(j\) is calculated as the ratio of the predicted (\(\hat{I}_j=\sum_i \hat{I}_{ij}\)) to the expected total number of cases in the district \(SIR_j=\frac{\hat{I}_j}{E_j}\).

The expected number of cases is obtained by applying the age-incidence rate in the reference population to the size of the district populations, i.e. \(E_j=\sum_i \frac{\hat{I}_{i.}}{py_{i.}}\times py_{ij}\), where \(\hat{I}_{i.}\) and \(py_{i.}\) refer respectively to the total number of predicted cancer cases and person-year of age \(i\) in the reference population.

To obtain standardized incidence ratio, we need to calculate the incidence rate by age in the reference population. If this population is taken as the districts covered by a cancer registry for example, we can use the following code to calculate SIR using the predicted incidence:

## 1- Calculate age incidence rate
## Total number of incident cases by age
dt_mela_H_rate<-CalibInc(mod=m_mela_H,pred=dt_mela_H,aggregate=~age)

## Add total number of py by age and calculate the rate
dt_mela_H_rate<-dt_mela_H_rate%>%
  left_join(
    dt_mela_H%>%
      group_by(age)%>%
      summarise(py=sum(py)))%>%
  mutate(rate=pred/py)

## 2- Calculate expected number of cases
dt_mela_H_E<-dt_mela_H%>%
  left_join(
    dt_mela_H_rate%>%select(age,rate)
  )%>%
  group_by(dist)%>%
  summarise(E=sum(py*rate))

## 3- Add expectd count to predicted counts and calculate the SIR
dt_SIR_mela_H<-dt_pred_mela_H%>%
  left_join(
    dt_mela_H_E)%>%
  mutate(SIR=pred/E)

3.2.2 Standardized incidence rate

Standardized incidence rates \(T_j\) are calculated in a given area \(j\) as the sum of incidence rate multiplied by the person year of a standard population \(py^r_{i}\) : \(\hat{T}_j=\sum_i \frac{\hat{I}_{ij}}{py_{ij}}\times py^r_{i}=\sum_i \hat{I}_{ij}\times \frac{py^r_{i}}{py_{ij}}\).

The CalibInc function possess a weight argument that can be used for this purpose. To calculate incidence rate standardized on the age structure of the world’s population in each districts covered by a cancer registry one can use:

## Add weights to dt_mela_H table
dt_mela_H_T<-dt_mela_H%>%mutate(w=pop_WHO/py)
## Calculate standardized incidence rates in the districts using the weights
CalibInc(mod=m_mela_H,pred=dt_mela_H_T,weight=w,aggregate=~dist)
## # A tibble: 17 x 4
##   dist   N_HC  pred    se
##   <fct> <dbl> <dbl> <dbl>
## 1 d_1    6.31 12.2   2.83
## 2 d_2    5.98 11.6   2.87
## 3 d_3    7.93 15.3   3.53
## 4 d_5    7.34 14.2   3.33
## 5 d_7    5.16  9.98  2.27
## # ... with 12 more rows

4 Evaluation of the quality of the predictions and of their usefulness for epidemiology:

To evaluate the quality of the predictions, we estimate, in the districts with registers, the gaps between predicted CI and observed CI. This is done using a cross validation procedure as follows: for each district \(j\) with a register, predicted CI were obtained using the HC/I ratio estimated by the model (1) evaluated after excluding district \(j\) and then prediction error (PE) between the total number of predicted cases (\(\hat{I}_{j}=\sum_i \hat{I}_{ij}\)) and the total number of observed incident cases (\(I_{j}=\sum_i I_{ij}\)) is computed: \(PE=\frac{\hat{I}_{j}-I_{j}}{I_{j}}\)

The epidemiological usefulness of CI predictions is evaluated using two principles. First, predictions with large errors (ie large PE, see previous paragraph) observed in districts with a cancer registry may be misleading and should be discarded. Second, predictions with errors of moderate size may provide important information about the spatial variation of the specific cancer site if the between-district variation of CI is much larger than the errors (i.e., if the “signal-to-noise” tradeoff remains favorable).

This signal-to-noise ratio is assessed by comparing the between-district variability of the HC/I ratio, which represents “noise” or error, with the between-district variability of incidence, which represents the epidemiological signal. Both are measured in districts with a cancer registry. The former constitutes the standard deviation \(\sigma_d\) from the calibration model (1). Fitting a Poisson mixed model to the observed CI rate, including age (penalized regression spline) and district random effect, the latter is the standard deviation \(\sigma_k\) of this random effect.

The above key quantities (PE and \(\sigma_d\) vs \(\sigma_k\)) are then placed in a decision tree, depicted in Figure 1, for each cancer site and gender, to decide whether a given method (i.e. the use of a given proxy) provides predictions of CI of adequate quality. When the absolute value of PE (APE) is low for all the districts with a register (i.e., below 15%), predictions are considered valid with this method (method rated A++). On the contrary, if at least one district has a high APE (i.e., larger than 30%) or if more than 20% of districts have large errors (between 15 and 30%), then the predictions are considered of insufficient quality (method rated B–). For intermediate-sized prediction errors (i.e., fewer than 20% of districts with an APE between 15 and 30%), predictions may be considered informative if geographical variations in CI rates are larger than the error (i.e., \(\sigma_k>2\times \sigma_d\), which translates into a correlation between predicted and true incidence of >0.9) (method rated A+). Methods rated A++ or A+ are considered suitable to predict CI in all districts.

4.1 Predictions errors in cross-validation

For each district in the registry area, the calibration model is fitted without the district and the total number of incident cases predicted in this district. For pancreas cancer in men, for district “d_1” and proxy “A”, we use the following code:

m_panc_cv_d1_A<-gam(form,family="quasipoisson",knots=nk,
       data=dt_panc%>%filter(SRC=="A",dist!="d_1",N_I>0),method="REML")
cv_d1_A<-CalibInc(m_panc_cv_d1_A,pred=dt_panc%>%filter(SRC=="A",dist=="d_1"),aggregate=~1)
cv_d1_A
## # A tibble: 1 x 3
##    N_HC  pred    se
##   <dbl> <dbl> <dbl>
## 1   357  536.  33.3

The relative difference is then computed by comparing predictions to observed number of cases:

cv_d1_A%>%
  bind_cols(dt_panc%>%filter(SRC=="A",dist=="d_1")%>%summarise(N_I=sum(N_I)))%>%
  mutate(PE=100*(pred-N_I)/N_I)
## # A tibble: 1 x 5
##    N_HC  pred    se   N_I    PE
##   <dbl> <dbl> <dbl> <int> <dbl>
## 1   357  536.  33.3   514  4.27

The two steps are then repetaed for each district in the registry area.

Code to compute complete cross-validation - pancreas, men

cv_panc<-dt_panc%>%
  group_by(dist,SRC)%>%
  nest()%>%
  mutate(cv=pmap(list(dist,SRC,data),
                 function(d,src,dt){
    m<-gam(form,family="quasipoisson",knots=nk,
           data=dt_panc%>%filter(SRC==src,dist!=d,N_I>0),method="REML")
    CalibInc(m,pred=dt%>%mutate(dist=d),aggregate=~1)
                 }))%>%
  unnest(cv)%>%
  select(-N_HC)%>%
  unnest(data)%>%
  group_by(dist,SRC)%>%
  summarise(pred=mean(pred),se=mean(se),N_I=sum(N_I),N_HC=sum(N_HC))%>%
  mutate(PE=100*(pred-N_I)/N_I)

Code to compute complete cross-validation - melanoma, women

cv_mela<-dt_mela%>%
  group_by(dist,SRC)%>%
  nest()%>%
  mutate(cv=pmap(list(dist,SRC,data),
                 function(d,src,dt){
    m<-gam(form,family="quasipoisson",knots=nk,
           data=dt_mela%>%filter(SRC==src,dist!=d,N_I>0),method="REML")
    CalibInc(m,pred=dt%>%mutate(dist=d),aggregate=~1)
                 }))%>%
  unnest(cv)%>%
  select(-N_HC)%>%
  unnest(data)%>%
  group_by(dist,SRC)%>%
  summarise(pred=mean(pred),se=mean(se),N_I=sum(N_I),N_HC=sum(N_HC))%>%
  mutate(PE=100*(pred-N_I)/N_I)

The results are stored in tables cv_panc and cv_mela for pancreas and melanoma cancers, respectively, printed below in tables 4.1 and 4.2.

For pancreas cancer in men, predictions errors are relatively small for HA/I method (all below 10% in absolute value), whereas for the other methods, errors are generally moderate, but consistent in some districts (PE=17% for district d_2 with A/I method, 16% and -17% for districts d_14 and d_15 with H/I, and 19% for district d_20 with M/I method).

Table 4.1: Detailed cross-validation in districts for pancreas, men
A/I
H/I
HA/I
M/I
dist N_I N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE
d_1 514 357 536 4.3 484 516 0.4 591 527.7 2.7 510 553.5 7.7
d_2 293 230 344.3 17.5 258 273.3 -6.7 330 294.1 0.4 279 304.8 4
d_3 581 378 568.8 -2.1 560 601.5 3.5 650 581.7 0.1 548 579.9 -0.2
d_4 444 297 434.1 -2.2 381 404.9 -8.8 474 421.6 -5.1 390 413.9 -6.8
d_5 426 297 447.9 5.1 381 405.5 -4.8 459 409.2 -3.9 371 399.6 -6.2
d_6 632 408 602.1 -4.7 574 612.1 -3.2 714 637.2 0.8 622 675.9 6.9
d_7 956 666 984.8 3 944 1011 5.8 1111 993.7 3.9 813 874.7 -8.5
d_8 1088 726 1073.8 -1.3 951 1011.4 -7 1174 1045.9 -3.9 919 983.5 -9.6
d_9 965 685 1034.2 7.2 845 899.5 -6.8 1053 938.6 -2.7 834 899.9 -6.7
d_10 954 637 945.8 -0.9 838 890.3 -6.7 1046 930.9 -2.4 845 907.1 -4.9
d_11 402 275 415.2 3.3 387 414.2 3 465 415.6 3.4 390 413.2 2.8
d_12 659 419 603.2 -8.5 657 702.4 6.6 766 683.4 3.7 603 652.8 -0.9
d_13 589 387 574.1 -2.5 556 593.8 0.8 648 576.9 -2 552 602 2.2
d_14 514 314 464.4 -9.6 555 598.3 16.4 621 557.2 8.4 490 522.9 1.7
d_15 301 210 326.2 8.4 235 250.8 -16.7 310 278 -7.6 274 286.4 -4.9
d_16 420 294 420.3 0.1 445 475.1 13.1 523 466.1 11 427 472 12.4
d_17 360 225 340.4 -5.5 334 358.9 -0.3 379 338.9 -5.9 318 340.2 -5.5
d_18 559 375 551.4 -1.4 564 604.9 8.2 643 575.2 2.9 554 594.8 6.4
d_19 293 205 303.9 3.7 282 302.1 3.1 320 286 -2.4 282 302.6 3.3
d_20 250 165 241.2 -3.5 242 259.4 3.8 289 258.3 3.3 276 297 18.8
Total 11200 7550 11212 17.5/3.6 10473 11185.3 16.7/6.2 12566 11216.3 11/3.3 10297 11076.7 18.8/5.8
Note: At the line “Total”, maximum and median absolute prediction error is reported in the PE column.

For melanoma, predictions errors are much larger with each method, with consistent errors in many districts (median absolute PE > 12%), and very large errors in some districts (at least on district with >30% absolute PE for each method).

Table 4.2: Detailed cross-validation in districts for melanoma, women
A/I
H/I
HA/I
M/I
dist N_I N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE N_HC \(\hat{I}_j\) PE
d_1 558 514 644.6 15.5 357 633.9 13.6 666 637.3 14.2 91 766.1 37.3
d_2 315 199 245.6 -22 160 281.4 -10.7 267 252.8 -19.8 30 269 -14.6
d_3 635 451 547 -13.9 424 737 16.1 664 618.7 -2.6 85 665.6 4.8
d_5 425 352 442.6 4.1 296 535 25.9 478 462.9 8.9 57 440 3.5
d_7 1339 772 945.8 -29.4 561 976 -27.1 1018 954.3 -28.7 127 994.7 -25.7
d_8 690 551 683.5 -0.9 358 647.5 -6.2 715 691.3 0.2 116 1028.4 49
d_9 1345 920 1160.5 -13.7 786 1437.9 6.9 1316 1289 -4.2 108 857 -36.3
d_10 1458 1137 1423.6 -2.4 1015 1860.8 27.6 1627 1576.5 8.1 168 1375.1 -5.7
d_11 460 489 607.6 32.1 358 614.5 33.6 664 620.1 34.8 95 690.3 50.1
d_12 838 499 621.3 -25.9 252 428.4 -48.9 638 608.6 -27.4 87 652.2 -22.2
d_13 802 546 691.8 -13.7 347 606.8 -24.3 701 676.9 -15.6 86 654.8 -18.4
d_15 278 253 315.3 13.4 181 320.9 15.4 327 310.7 11.7 37 295.2 6.2
d_16 376 379 488 29.8 198 345.1 -8.2 452 441 17.3 60 543.8 44.6
d_17 300 239 297.1 -1 175 310.4 3.5 317 300.2 0.1 40 255.2 -14.9
d_18 606 560 697.7 15.1 317 543.9 -10.2 672 628.8 3.8 89 724.9 19.6
d_19 309 231 285.5 -7.6 183 316.9 2.6 307 289.1 -6.4 35 279.1 -9.7
d_20 314 315 403 28.4 173 296.4 -5.6 368 353.3 12.5 30 245.3 -21.9
Total 11048 8407 10500.5 32.1/13.9 6141 10892.9 48.9/13.6 11197 10711.4 34.8/11.7 1341 10736.7 50.1/19.6
Note: At the line “Total”, maximum and median absolute prediction error is reported in the PE column.

The following section presents the code to draw graphical representations of cross-validation results that are not mentioned in the article. This section in not essential and may be skipped by the reader.

Predictions errors in cross-validation may advantageously be plotted under the form of funnel plots, which allows for a rapid examination of the extend of the errors given their precision (Spiegelhalter 2005). In these funnel plots, the precision of the errors was defined as the inverse of its coefficient of variation, and confidence limits at the 95% levels under the null hypothesis of zero error are drawn.

Code for funnel plot - pancreas, men

## Define the 95% confidence limits given the range of the precisions (1/cv)
## of the errors (number of predicted cases follow a log-normal distribution).
level<-0.95
z <- qnorm(1 - (1 - level)/2)
cvs<-range(bind_rows(cv_panc,cv_mela)%>%mutate(cv=se/pred)%$%cv)
fun_bound<-tibble(cv=seq(cvs[1],cvs[2],length=100),pred=100,se=cv)%>%
  mutate(low = pred * sqrt(cv^2 + 1) * exp(-z * sqrt(log(cv^2 + 1))),
         up = pred * sqrt(cv^2 + 1) * exp(+z * sqrt(log(cv^2 + 1))))

## Plot the errors (point size proportionnal to the number of incident cases)
## and add confidence limits
cv_panc%>%
  ggplot(data=.)+
  geom_point(aes(PE,pred/se,size=N_I),alpha=0.2)+
  geom_line(data=fun_bound,aes(up-100,1/se))+
  geom_line(data=fun_bound,aes(low-100,1/se))+
  facet_wrap(~SRC)+
  coord_cartesian(xlim=c(-50,50))+
  theme(legend.position="bottom")+
  ylab("1/cv")

From the funnel plots for pancreas cancer in men, it first should be noticed that all the districts lies within or very close to the 95% confidence limits. Thus, no district appears as influential or outlier. Second, it appears that prediction errors are larger for H/I and M/I methods (i.e. lower precision - y axis) than for A/I and HA/I. Those results are in line with the estimations of table 2.1 (the district variability of the ratios \(\sigma_d\) are higher for H/I and M/I methods).

Funnel plot of predicitons errors in cross-validaion - pancreas, men

Figure 4.1: Funnel plot of predicitons errors in cross-validaion - pancreas, men

The funnel plots for melanoma in women show a very different pattern. Predictions errors are much larger for every method, and the precision is rather poor.

Funnel plot of predicitons errors in cross-validaion - melanoma, women

Figure 4.2: Funnel plot of predicitons errors in cross-validaion - melanoma, women

An other useful representation is to plot observed and predicted standardized indicators of incidence. We may for example plot the observed and predicted SIR (along with 95% PI) for each district arranged in increasing value of SIR, using a so called caterpillar plot. This type of representation allows to better appreciate the effects of the error on the indicators of interest.

Code for Caterpillar plot - pancreas, men

## Calculate expected number of cases in each district in the registry area
E_panc<-dt_panc%>%
  group_by(SRC)%>%
  nest()%>%
  mutate(E=pmap(list(SRC,data),
                 function(src,dt){
                   ## Get calibration model
                   m<-get(paste0("m_panc_",src))
                   ## Calculate incidence rates by age in the registry area
                   E<-dt%>%
                     group_by(age)%>%
                     mutate(w=1/sum(py))%>%
                     CalibInc(m,pred=.,aggregate=~age,weight=w)
                   ## Add district's py and sum py*rate by district
                   E%>%
                     left_join(dt%>%select(dist,age,py))%>%
                     group_by(dist)%>%
                     summarise(E=sum(pred*py))
                 }))%>%
  unnest(E)
## Add expected to predictions in cross-validation and 95% PI
SIR_panc_cv<-E_panc%>%left_join(cv_panc%>%select(SRC,dist,pred,se))%>%
  LogNormPI()
## Calculate the ratio
SIR_panc_cv<-SIR_panc_cv%>%mutate(pred=pred/E,se=se/E,low=low/E,up=up/E)

## Idem with observed incidence
SIR_panc_obs<-dt_panc%>%
  group_by(SRC,age)%>%
  mutate(tx=sum(N_I)/sum(py))%>%
  group_by(SRC,dist)%>%
  summarise(O_obs=sum(N_I),E_obs=sum(py*tx),SIR_obs=O_obs/E_obs)

## Plot the results
SIR_panc_cv<-SIR_panc_cv%>%
  left_join(SIR_panc_obs)%>%
  group_by(SRC)%>%
  arrange(SRC,SIR_obs)%>%
  mutate(rg=row_number())

SIR_panc_cv%>%
  ggplot(data=.)+
  geom_point(aes(rg-0.2,SIR_obs,size=O_obs,color="Obs."),alpha=0.3)+
  geom_pointrange(aes(rg+0.2,pred,ymin=low,ymax=up,color="Pred."))+
  facet_wrap(~SRC)+
  geom_hline(aes(yintercept=1))+
  xlab("District")+ylab("SIR")+
  labs(size="# I",colour="")+
  scale_x_continuous(breaks=unique(SIR_panc_cv$rg),labels=unique(SIR_panc_cv$dist),minor_breaks =NULL)+
  theme(axis.text.x=element_text(angle=90,hjust = 1,vjust=.5),legend.position="bottom",
        panel.grid.major.x = element_blank())

For pancreas for example, we can see that the 17.5% prediction error in district d_2 observed with A/I method (see table 4.1) results in predicted SIR above 1.1 (i.e. incidence in excess in the district), whereas observed SIR is indeed just below 1 (i.e. tiny under-incidence in the district).

Funnel plot of SIR predicited in cross-validaion - pancreas, men

Figure 4.3: Funnel plot of SIR predicited in cross-validaion - pancreas, men

For melanoma it is clear from the plot that the errors brings a lot of diverging conclusions about which district stay above or below the mean level of incidence in districts covered by registries.

Funnel plot of SIR predicited in cross-validaion - melanoma, women

Figure 4.4: Funnel plot of SIR predicited in cross-validaion - melanoma, women

4.2 Comparison between the variability of the HC/I ratio to the geographical variability of incidence over the districts

The geographical variability of incidence over the districts (\(\sigma_k\)) is obtained by fitting a Poisson mixed model to observed incidence rate, with age introduced as a continuous effect (modelled with a penalized regression spline) and district as a random intercept (standard deviation \(\sigma_k\)).

The Poisson model for incidence rate has the following formulation

nk<-list(age=unique(dt_calib$age)) 
form_i<-N_I~offset(log(py))+s(age,bs="tp")+s(dist,bs="re") 

and is applied to the two cancer sites under evaluation.

m_i_panc<-gam(form_i,family="quasipoisson",method="REML",
              data=dt_calib%>%filter(site=="Pancreas-men",SRC=="H"))
m_i_mela<-gam(form_i,family="quasipoisson",method="REML",
              data=dt_calib%>%filter(site=="Melanoma-women",SRC=="H"))

Values of \(\sigma_k\) are again extracted from the models using the gam.vcomp function.

4.3 Decision rule

The quantities calculated in the validations steps detailed in sections 4.1 and 4.2 can finally be summarised via the application of the decision rule, recalled below.

Decision rule for evaluating cancer site for which prediction of district-level incidence based on the HC/I ratio are valuable.

Figure 4.5: Decision rule for evaluating cancer site for which prediction of district-level incidence based on the HC/I ratio are valuable.

The criteria are summarised for each method and cancer site in the table below:

Table 4.3: Between district-variability of the HC/I ratios and geographical variability of I
Distribution of APE
\(\sigma_d\) v.s. \(\sigma_k\)
# districts with APE>30/>15%
\(\sigma_d\)
Cancer site # districts A H HA M \(\sigma_k\) A H HA M
Pancreas-men 20 0/1 0/2 0/0 0/1 0.08 0.02 0.06 0.02 0.05
Melanoma-women 17 1/8 2/8 1/6 5/10 0.18 0.17 0.21 0.15 0.18

For pancreas in men, as already noted, predictions with the the HA/I method has small errors in all districts (below 15%), and the method is eligible for predictions, labelled A++. For A/I, M/I and H/I method, predictions in respectively one, one and two districts have intermediate error size (between 15 and 30%), and are not eligible regarding this criteria. However, looking at the signal to noise trade-off (i.e. \(\sigma_d\) v.s. \(\sigma_k\)), the variability of the A/I ratio between district remains small compared to the geographical variation of incidence (0.02 v.s. 0.08), and the method is suitable for prediction (labelled A+). For the H/I and M/I method on the opposite, the variability of the ratios are too high (0.06 and 0.05 v.s. 0.08), and the method are not suitable for predictions (labelled B-). In this application, the chosen method of prediction is HA/I.

For melanoma in women, the errors are too consistent with all methods (APE>30% for 1 to 5 districts), which are all labelled B- -. No valuable prediction of incidence can thus be done for this cancer site.

5 Prediction of incidence in all districts and disease mapping, pancreas-men

For pancreas cancer in men, we saw in the previous sections that incidence can be predicted using the HA/I method with acceptable predictions error. Useful prediction of incidence, as detailed in section 3, can thus be produced in all the districts using the HA/I method.

In order to visualise geographical gradient of predicted incidence, we may represent standardized incidence ratio on a map.

5.1 Prediction of incidence and SIR calculation

The first step is to calculate SIR as in 3.2.1.

Code to compute SIR

dt_pred_panc<-dt_pred%>%filter(site=="Pancreas-men",SRC=="HA")

## Prediction of total number of cases by districts + IP
P<-CalibInc(mod=m_panc_HA,pred=dt_pred_panc,aggregate=~dist)%>%
  LogNormPI()

## Expected number of cases by districts
## (make use of weight argument to calculate incidence rate by age)
E<-dt_pred_panc%>%
  group_by(age)%>%
  mutate(w=1/sum(py))%>%
  CalibInc(mod=m_panc_HA,pred=,aggregate=~age,weight=w)%>%
  select(age,pred)%>%
  rename(rate=pred)%>%
  right_join(dt_pred_panc%>%select(dist,age,py))%>%
  group_by(dist)%>%
  summarise(E=sum(rate*py))

## Calculte SIR
SIR_panc<-P%>%left_join(E)%>%
  mutate(SIR=pred/E,low=low/E,up=up/E)

SIR_panc
## # A tibble: 95 x 8
##   dist   N_HC  pred    se   low    up     E   SIR
##   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 01      555  496.  29.7 0.842  1.06  525. 0.944
## 2 02      619  551.  31.5 1.01   1.27  486. 1.13 
## 3 03      453  406.  26.5 0.900  1.16  398. 1.02 
## 4 04      199  178.  16.8 0.802  1.16  185. 0.960
## 5 05      180  162.  16.0 0.893  1.32  150. 1.08 
## # ... with 90 more rows

5.2 Graphical representation

The spatial coordinates of the 95 French’s administrative districts are available in the CalibInc package as a SpatialPolygons object : dep_fr_95. Note that the Corse “département” has a single geographical unit “20”, since the two entities “2A” and “2B” are deprecated and regrouped under the single unit “20” in the data.

Mapping of the SIR may then be achieved using the utility function ggMap:

SIR_panc%>%rename(id=dist)%>%## id is the binding key between the spatial coordinates dep_fr_95 and the SIR_panc data
  ggMap(data=.,SIR,map=dep_fr_95,
        color="RdYlGn",rev=T,limits=c(0.7,1.3))
Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

Figure 5.1: Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

5.3 Smoothing of SIR

5.3.1 Setting up BYM2 model

The bayesian spatial model used to smooth the SIR is fitted with the INLA package, using the “bym2” prior.

To specify the “bym2” prior, we first need to create a matrix that specify which districts share a common boundary (adjacency matrix). This is done by applying the nb2INLA function from the spdep package to the neighbouring structure of french districts, extracted from dep_fr_95 thanks to the poly2nb function. The adjacency matrix is stored in the working directory (file “graph.inla”).

library(INLA)
dep_fr_95.nb<-spdep::poly2nb(dep_fr_95)
spdep::nb2INLA("graph.inla",dep_fr_95.nb)

The districts are sorted in “graph.inla” according to the order of the districts in dep_fr_95.nb (same order as in the SIR_panc data). We need to convert the dist variable as a rank variable in order to match the adjacency matrix with the predictions:

SIR_panc<-SIR_panc%>%
  mutate(dist_rank=as.numeric(as.factor(dist)))

Secondly, we need to specify that the predictions are log-normally distributed. To do so, we take advantage of the correspondence between the log-normal and normal distribution, to convert our predictions on a gaussian scale (see (Chatignoux et al. 2019) Supplementary Material, section 3). Having a prediction distributed according to a log-normal distribution of mean \(\mu\) and standard error \(se\) is equivalent as having a log prediction following a normal distribution of variance \(s=\log(1+(se/\mu)^2)\) and mean \(\log(\mu)+s/2\).

We thus apply this transformation to our predictions, so we can use a gaussian distribution to fit the model.

SIR_panc<-SIR_panc%>%
  mutate(cv=se/pred,s=log(1+cv^2),lp=log(pred)+s/2)

Thirdly, we have to specify a prior distribution for the hyper parameters \(\sigma\) and \(\phi\) of the model.

The uniform a priori for \(\phi\) is easily specified with a Beta(1,1) distribution. For \(\sigma\), the parameters are specified in INLA in terms of precision, i.e. \(1/\sigma^2\). We used a prior gamma distribution for the precision, with parameters mimicking a uniform distribution for \(\sigma\) between 0.01 and 0.4 (usual range of geographical variability previously observed for incidence in France). To do so, we empirically calculate the mean and variance for the precision that correspond to the uniform distribution for \(\sigma\), and convert them in terms of scale and shape for the gamma distribution.

This is done with the calc.prec function:

calc.prec<-function(a1,b1){
  sigma.v0<-runif(n=10000,min=a1,max=b1)
  tau.v<-1/sigma.v0^2
  a2<-mean(tau.v)^2/var(tau.v)
  b2<-a2/mean(tau.v)
  c(a2,b2)
}

We now have all the elements to write down the model’s formulae:

formBym2<-lp~offset(log(E))+
  f(dist_rank,model="bym2",graph="graph.inla", scale.model = TRUE,constr=T,
    hyper=list(
      phi=list(prior="logitbeta",param=c(1,1)),
      prec=list(prior="loggamma",param=calc.prec(0.01,0.4))
    ))

To evaluate the model, we lastly need to specify that the distribution has a fixed and know variance. This can be done by tuning the control.family option (set to known precision of 1) and the scale argument of the inla function (set to the known precision 1/s):

fit_bym2<-inla(formBym2, family ="normal", data=SIR_panc,
               control.family = list(hyper =list(prec = list(fixed = TRUE, initial = 0))),
               scale=1/s,control.predictor=list(link=1,compute=T),
               control.compute=list(cpo=TRUE,dic=T,config=T,waic = TRUE))

5.3.2 Extract model parameters

The posterior estimates of the model key parameters \(\sigma\) and \(\phi\) are given as results of the inla function:

fit_bym2$summary.hyperpar
##                                mean         sd  0.025quant    0.5quant  0.975quant        mode
## Precision for dist_rank 277.0621786 81.2501652 152.7982783 265.2178085 468.5115954 243.1931861
## Phi for dist_rank         0.8437502  0.1231704   0.5285776   0.8754864   0.9875436   0.9636108

For \(\sigma\), the posterior are given in terms of precision (\(1/\sigma^2\)). To convert this on the standard error scale, we can sample the precision from its posterior distribution, and convert it to the standard error scale:

post_prec<-fit_bym2$marginals.hyperpar$"Precision for dist_rank"
post_sd<-inla.rmarginal(1000,inla.tmarginal(function(x) 1/sqrt(x), post_prec))
c(mean(post_sd),quantile(post_sd,p=c(0.025,0.975)))
##                  2.5%      97.5% 
## 0.06191034 0.04665856 0.08038263

The estimate \(\sigma\) of the variability of the SIR across all France’s districts (0.06 [0.05;0.08]) is close to the variability that we estimated on the registry districts only (\(\sigma_k\) = 0.08). Most of this variation is explained by the spatially structured component (\(\phi\)= 0.84 [0.53;0.98]).

5.3.3 Get smoothed predictions

Posterior estimates of the predicted number of cases are retrieved by taking the exponential of the model’s fitted values. This can be done with the following function:

get_pred<-function(model){
  exp.marg<-lapply(model$marginals.fitted.values,function(m) inla.tmarginal(function(x) exp(x), m))
  exp.marg<-t(sapply(exp.marg,function(em) inla.zmarginal(em,silent = TRUE)%>%do.call("c",.)))
  tibble(fit=exp.marg[,1],se.fit=exp.marg[,2],fit.low=exp.marg[,3],fit.up=exp.marg[,7])
}

We apply the function to the model and add the smoothed predictions to the original data:

SIR_panc<-SIR_panc%>%bind_cols(get_pred(fit_bym2))

5.3.4 Map of smoothed SIR

ggMap can again be used to map the smoothed SIR:

SIR_panc%>%rename(id=dist)%>%
  mutate(SIR_smooth=fit/E)%>%
  ggMap(data=.,SIR_smooth,map=dep_fr_95,
        legend = list(title="Smoothed\nSIR"),
        color="RdYlGn",rev=T,limits=c(0.7,1.3))
Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

Figure 5.2: Maps of predicted standardized incidence ratio for pancreas cancer in men, 2007-2015.

5.3.5 Comparison between smoothed an crude SIR

Compared to the map of “crude” SIR (figure 5.1), the map of smoothed SIR looks more homogeneous, showing a west-east gradient in the incidence of pancreatic cancer in men.

Given the large amount of the variation explained by the spatially structured component of the model, predicted incidence in districts having atypical SIR as compared to their neighbours have been shrunken toward the mean of the neighbours. For example, in north east of France, the map of crude SIR showed districts with low incidence (SIR<1) that shared a boundary with districts of high incidence (SIR>1). In the smoothed map, theses diverging pattern largely disappeared, providing a more realistic picture of the underlying risk surface.

It is important to recall that the amount of smoothing is higher for districts with uncertain crude prediction. This may be seen in the figure 5.3, that represents crude and smoothed SIR on a funnel plot (see section 4.1). In this funnel plot, each point represent a district, crude SIR being represented as red dots, smoothed SIR as green triangle. A line joins the crude and smooth estimate for the same district. We clearly see that the more uncertain crude SIR (at the bottom of the graph) undergo a larger amount of shrinkage than the other SIR.

Code for the graph on the effects of smoothing on point and precision estimates - Funnel plot - pancreas, men

dt_fun_SIR_panc<-SIR_panc%>%
  mutate(SIR_smooth=fit/E)%>%
  rename(SIR_crude=SIR,low_crude=low,up_crude=up,cv_crude=cv)%>%
  mutate(low_smooth=fit.low/E,up_smooth=fit.up/E,cv_smooth=se.fit/fit)

level<-0.95
z <- qnorm(1 - (1 - level)/2)
cvs<-range(dt_fun_SIR_panc$cv_crude,dt_fun_SIR_panc$cv_smooth)*c(0.9,1.2)
fun_bound<-tibble(cv=seq(cvs[1],cvs[2],length=100),pred=1,se=cv)%>%
  mutate(low = pred * sqrt(cv^2 + 1) * exp(-z * sqrt(log(cv^2 + 1))),
         up = pred * sqrt(cv^2 + 1) * exp(+z * sqrt(log(cv^2 + 1))))


dt_fun_SIR_panc%>%
  gather(var,val,SIR_smooth,SIR_crude,low_crude,up_crude,low_smooth,up_smooth,cv_smooth,cv_crude)%>%
  separate(var,c("var","type_est"))%>%
  spread(var,val)%>%
  ggplot()+
  geom_point(aes(SIR,1/cv,colour=type_est,shape=type_est),size=2)+
  geom_segment(data=dt_fun_SIR_panc,
               aes(y=1/cv_crude,yend=1/cv_smooth,x=SIR_crude,xend=SIR_smooth),colour="grey",size=.3)+
  geom_line(data=fun_bound,aes(up,1/se))+
  geom_line(data=fun_bound,aes(low,1/se))+
  labs(colour="Prediction type",shape="Prediction type")+
  theme(legend.position="bottom")
Effects of smoothing on point and precision estimates - Funnel plot - pancreas, men

Figure 5.3: Effects of smoothing on point and precision estimates - Funnel plot - pancreas, men

The preceding plot also show that the precision of the smoothed SIR are larger than the precision of the crude SIR. Indeed, as smoothing borrows information from the all the district to consolidate the estimation of the SIR in one district, points estimates are less variable. This latter point may be better seen in caterpillar plots, comparing crude and smoothed estimates (figure 5.4).

Code for the caterpillar plot of crude and smoothed SIR - pancreas, men

SIR_panc%>%
  mutate(SIR_smooth=fit/E)%>%
  rename(SIR_crude=SIR,low_crude=low,up_crude=up)%>%
  mutate(low_smooth=fit.low/E,up_smooth=fit.up/E)%>%
  gather(var,val,SIR_smooth,SIR_crude,low_crude,up_crude,low_smooth,up_smooth)%>%
  separate(var,c("var","type_est"))%>%
  spread(var,val)%>%
  group_by(type_est)%>%
  arrange(SIR)%>%
  mutate(id=row_number())%>%
  qplot(data=.,id,SIR)+geom_linerange(aes(ymin=low,ymax=up))+
  facet_wrap(~type_est)+
  geom_hline(yintercept=1)+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())+
  xlab("Districts")
Caterpillar plot of crude and smoothed SIR - pancreas, men, 2007-2015

Figure 5.4: Caterpillar plot of crude and smoothed SIR - pancreas, men, 2007-2015

6 References

Chatignoux, Édouard, Laurent Remontet, Jean Iwaz, Marc Colonna, and Zoé Uhry. 2019. “For a sound use of health care data in epidemiology: evaluation of a calibration model for count data with application to prediction of cancer incidence in areas without cancer registry.” Biostatistics (Oxford, England) 20 (3): 452–67. https://doi.org/10.1093/biostatistics/kxy012.

Spiegelhalter, David J. 2005. “Funnel plots for comparing institutional performance.” Statistics in Medicine 24 (8): 1185–1202. https://doi.org/10.1002/sim.1970.